Researchers: Brandon J Jaquis, Andrew SID Lang, Ailin Li, Nolan D Monnier
License: CC0
Data
We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.
Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.8) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens and exported as a .csv file: CDKDescriptors.csv.
Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA" using the following code.
setwd("C://..")
mydata = read.csv(file="CDKDescriptors.csv",header=TRUE,row.names="molID")
ncol(mydata) #get number of columns
[output]
287
[output]
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
222
[output]
write.csv(mydata,file="CDKDescriptorsNoNA.csv")
Here's the resulting file: CDKDescriptorsNoNA.csv. Then we used the caret package to remove columns with all the same value.
library(caret) ##for feature selection
nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
120
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")
Here's the resulting file: DescriptorsNoNZV.csv. Opening the resulting file, we see that there are still a few "NA" values. Several rows molID: 57-62 (Helium, Neon, Argon, Xenon, Radon, Hydrogen) were removed due to multiple NA values. Several columns that had multiple "NA" values were also removed (HybRatio, Kier3, topoShape). This leaves us with a set of 138 deltaH values and 116 CDK descriptors: DescriptorsNoNZVCleaned.csv.
Feature Selection
A correlation matrix was built, and used to eliminate columns with a high covariance. Between two columns of high covariance, we removed the column with the least uniqueness to other columns.
Like in the previous year, the data contains 138 molecules with 86 descriptors. This shows an important level of consistency between the chemical spaces. Thus, the histogram of molecular weights is also the same:
To better visualize the chemical space, the data set is projected using PCA. We also clustered the data to get a better understanding of the data spread.
The following picture is the k-means clustered data points graphed over PC1&2, and PC1&3. Unlike the past year, we scaled before selecting the number of clusters, resulting in 1 less cluster. The program appeared to cluster between the main group (containing 84.1% of the data) and a secondary outlier group.
Here is the resulting file: pca.csv
Modeling
This is first model we created. No particular conditions were set for the randomForest function. Our sample size was 75% with a seed of 123.
library("randomForest")
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
mydata.rf <- randomForest(deltaH ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = deltaH ~ ., data = train, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 28
Mean of squared residuals: 88.37216
% Var explained: 83.95
[output]
saveRDS(mydata.rf, file = "EnthalpyTraining Set") #Stores File
varImpPlot(mydata.rf, sort = T, main="Variable Importance")#Displays Importance Plot
The actual R model is contained in this file: EnthalpyTraining Set. Here is the Variable Importance Plot. Again we see nHBDon, ohs.sOH, TopoPSA are the three most important variables, although their order has changed.
#Crossvalidation
training.predict <- predict(mydata.rf,train) #Predicts train set
cluster.train <- fit$cluster[train_ind]
Training <- cbind(train, training.predict, cluster.train) #Combines all train data into one file
write.csv(Training, file = "RFTrainingSetPredict.csv") #Saves file
test.predict <- predict(mydata.rf,test) #Predicts test set
cluster.test <- fit$cluster[-train_ind]
Test <- cbind(test, test.predict, cluster.test) #Combines all test set into one file
write.csv(Test, file = "RFTestSetPredict.csv") #Saves file
Here are the files: RFTrainingSetPredict.csv, RFTestSetPredict.csv.
Our test set R^2 value is 0.887. Our test set RMSE 7.0988. For our individual clusters:
Cluster 1: RMSE: 6.558
Cluster 2: RMSE 9.737
Our file, with residuals and crossvalidation statistics is RFSetPredict.csv. Our chemical space with residuals is also shown below:
H2O Deep Learning Model
As an alternative to the random forest model, we built an H2O Neural Network to try and create an even stronger model.To directly compared the results with the RF model, the same train and test set was used.
library(h2o)
h2o.init()
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible,,
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
#Prepare data for H2O
trainH2O<-as.h2o(train)
testH2O<-as.h2o(test)
Because the layer width is a critical component for accuracy, the ten models were created with the layer width varied. The process was repeated twice before selection, after which 10 more precise layer width were observed. It is important to note that the model was fine-tuned for the seed we were using.
residuals <- function(f,t){(f-t)^2} #functions for determining accuracy
R2 <- function(f,t){
tavg<-mean(t)
SSres<-sum(residuals(f,t))
SStot<-sum((t-tavg)^2)
output<-(1-sum(SSres)/SStot)
print(output)
}
for (i in 1:10){
hNeuralNet<-h2o.deeplearning(y="deltaH",
training_frame=trainH2O,
model_id="hNeuralNet",
hidden=c((100*i),(100*i)),
standardize=TRUE,
seed=123,
max_runtime_secs=300,
variable_importances=TRUE)
#cross validation
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
R2(predictTrain,trainH2O$deltaH)
R2(predictTest,testH2O$deltaH)
}
#Consistently high marks at 100
for (i in 1:10){
hNeuralNet<-h2o.deeplearning(y="deltaH",
training_frame=trainH2O,
model_id="hNeuralNet",
hidden=c((50+10*i),(50+10*i)),
standardize=TRUE,
seed=123,
max_runtime_secs=300,
variable_importances=TRUE)
#cross validation
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
R2(predictTrain,trainH2O$deltaH)
R2(predictTest,testH2O$deltaH)
}
A hidden layer width of 130 was selected. The data was standardized to facilitate H2O, and the seed was set to make the results reproducible. The final model is shown below:
hNeuralNet<-h2o.deeplearning(y="deltaH",
training_frame=trainH2O,
model_id="hNeuralNet",
standardize=TRUE,
seed=123,
hidden=c(130,130),
variable_importances=TRUE)
summary(hNeuralNet)
[output]
H2ORegressionMetrics: deeplearning
Reported on training data.
**Metrics reported on full training frame**
MSE: 27.54657
RMSE: 5.248483
MAE: 3.770966
RMSLE: NaN
Mean Residual Deviance : 27.54657
Variable Importances:
variable relative_importance scaled_importance percentage
1 nAtomP 1.000000 1.000000 0.013171
2 SPC.4 0.967459 0.967459 0.012743
3 khs.sssCH 0.964586 0.964586 0.012705
4 MLogP 0.961444 0.961444 0.012664
5 ATSm1 0.956220 0.956220 0.012595
---
80 nRotB 0.829468 0.829468 0.010925
81 SCH.7 0.827819 0.827819 0.010904
82 VCH.7 0.815953 0.815953 0.010747
83 C2SP2 0.801603 0.801603 0.010558
84 TopoPSA 0.798668 0.798668 0.010520
85 nAromBlocks 0.781421 0.781421 0.010292
[output]
#Saving information
predictTrain<-predict(hNeuralNet,trainH2O)
predictTest<-predict(hNeuralNet,testH2O)
resT <- residuals(predictTest,testH2O$deltaH)
resTr <- residuals(predictTrain,trainH2O$deltaH)
TEST <- cbind(as.data.frame(testH2O),as.data.frame(predictTest),as.data.frame(resT))
TRAIN <- cbind(as.data.frame(trainH2O),as.data.frame(predictTrain),as.data.frame(resTr))
#cluster information will be added in manually
write.csv(TEST, file="NNTestSetPredict.csv")
write.csv(TRAIN, file="NNTrainSetPredict.csv")
h2o.saveModel(object=hNeuralNet, path=getwd(), force=TRUE)
Here are the resulting files: hNeuralNet, NNTrainSetPredict.csv, NNTestSetPredict.csv, NNSetPredict.csv
Our test set R^2 value is 0.903, an increase of 1.8% relative to RF. Our test set RMSE is 6.601, a 7.0% reduction in error. For our individual clusters:
Cluster 1: RMSE: 5.13
Cluster 2: RMSE 7.43
Our file, with crossvalidation statistics is NNTestSetCV.csv. Our chemical space with residuals is also shown below:
Enthalpy of Solvation
Researchers: Brandon J Jaquis, Andrew SID Lang, Ailin Li, Nolan D MonnierLicense: CC0
Data
We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.8) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens and exported as a .csv file: CDKDescriptors.csv.Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA" using the following code.
setwd("C://..") mydata = read.csv(file="CDKDescriptors.csv",header=TRUE,row.names="molID") ncol(mydata) #get number of columns [output] 287 [output] mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA" ncol(mydata) [output] 222 [output] write.csv(mydata,file="CDKDescriptorsNoNA.csv")Here's the resulting file: CDKDescriptorsNoNA.csv. Then we used the caret package to remove columns with all the same value.Here's the resulting file: DescriptorsNoNZV.csv. Opening the resulting file, we see that there are still a few "NA" values. Several rows molID: 57-62 (Helium, Neon, Argon, Xenon, Radon, Hydrogen) were removed due to multiple NA values. Several columns that had multiple "NA" values were also removed (HybRatio, Kier3, topoShape). This leaves us with a set of 138 deltaH values and 116 CDK descriptors: DescriptorsNoNZVCleaned.csv.
Feature Selection
A correlation matrix was built, and used to eliminate columns with a high covariance. Between two columns of high covariance, we removed the column with the least uniqueness to other columns.
Here is the resulting file: SolventInformationFeatureSelected.csv
Chemical Space
Like in the previous year, the data contains 138 molecules with 86 descriptors. This shows an important level of consistency between the chemical spaces. Thus, the histogram of molecular weights is also the same:
To better visualize the chemical space, the data set is projected using PCA. We also clustered the data to get a better understanding of the data spread.
library(fpc) library(cluster) mydata <-read.csv(file="SolventInformationFeatureSelected.csv", header=TRUE, row.names="molID") #PCA pc <- prcomp(mydata, scale. = T) p <- pc$x summary(pc) [output] Importance of components%s: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 Standard deviation 4.8857 3.4874 2.9531 2.68900 2.36460 2.04054 1.80613 1.64269 1.5379 Proportion of Variance 0.2776 0.1414 0.1024 0.08408 0.06502 0.04842 0.03793 0.03138 0.0275 Cumulative Proportion 0.2776 0.4190 0.5204 0.60446 0.66947 0.71789 0.75582 0.78720 0.8147 [output] #K-means clustering mydataS <- scale(mydata) asw <- numeric(5) #determining K for (k in 2:5){ asw[[k]] <- pam(mydataS, k) $ silinfo $ avg.width } k.best <- which.max(asw) k.best [output] 2 [output] fit<-kmeans(mydataS,k.best) #clustering #Saving Results p<-cbind(p,fit$cluster) #Writing as one file write.csv(p, file = "pca.csv") #SavingThe following picture is the k-means clustered data points graphed over PC1&2, and PC1&3. Unlike the past year, we scaled before selecting the number of clusters, resulting in 1 less cluster. The program appeared to cluster between the main group (containing 84.1% of the data) and a secondary outlier group.Here is the resulting file: pca.csv
Modeling
This is first model we created. No particular conditions were set for the randomForest function. Our sample size was 75% with a seed of 123.library("randomForest") mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="molID") ## 75% of the sample size smp_size <- floor(0.75 * nrow(mydata)) ## set the seed to make your partition reproductible set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size) train <- mydata[train_ind, ] test <- mydata[-train_ind, ] mydata.rf <- randomForest(deltaH ~ ., data = train,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = deltaH ~ ., data = train, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28 Mean of squared residuals: 88.37216 % Var explained: 83.95 [output] saveRDS(mydata.rf, file = "EnthalpyTraining Set") #Stores File varImpPlot(mydata.rf, sort = T, main="Variable Importance")#Displays Importance PlotThe actual R model is contained in this file: EnthalpyTraining Set. Here is the Variable Importance Plot. Again we see nHBDon, ohs.sOH, TopoPSA are the three most important variables, although their order has changed.#Crossvalidation training.predict <- predict(mydata.rf,train) #Predicts train set cluster.train <- fit$cluster[train_ind] Training <- cbind(train, training.predict, cluster.train) #Combines all train data into one file write.csv(Training, file = "RFTrainingSetPredict.csv") #Saves file test.predict <- predict(mydata.rf,test) #Predicts test set cluster.test <- fit$cluster[-train_ind] Test <- cbind(test, test.predict, cluster.test) #Combines all test set into one file write.csv(Test, file = "RFTestSetPredict.csv") #Saves fileHere are the files: RFTrainingSetPredict.csv, RFTestSetPredict.csv.Our test set R^2 value is 0.887. Our test set RMSE 7.0988. For our individual clusters:
Cluster 1: RMSE: 6.558
Cluster 2: RMSE 9.737
Our file, with residuals and crossvalidation statistics is RFSetPredict.csv. Our chemical space with residuals is also shown below:
H2O Deep Learning Model
As an alternative to the random forest model, we built an H2O Neural Network to try and create an even stronger model.To directly compared the results with the RF model, the same train and test set was used.
Because the layer width is a critical component for accuracy, the ten models were created with the layer width varied. The process was repeated twice before selection, after which 10 more precise layer width were observed. It is important to note that the model was fine-tuned for the seed we were using.
residuals <- function(f,t){(f-t)^2} #functions for determining accuracy R2 <- function(f,t){ tavg<-mean(t) SSres<-sum(residuals(f,t)) SStot<-sum((t-tavg)^2) output<-(1-sum(SSres)/SStot) print(output) } for (i in 1:10){ hNeuralNet<-h2o.deeplearning(y="deltaH", training_frame=trainH2O, model_id="hNeuralNet", hidden=c((100*i),(100*i)), standardize=TRUE, seed=123, max_runtime_secs=300, variable_importances=TRUE) #cross validation predictTrain<-predict(hNeuralNet,trainH2O) predictTest<-predict(hNeuralNet,testH2O) R2(predictTrain,trainH2O$deltaH) R2(predictTest,testH2O$deltaH) } #Consistently high marks at 100 for (i in 1:10){ hNeuralNet<-h2o.deeplearning(y="deltaH", training_frame=trainH2O, model_id="hNeuralNet", hidden=c((50+10*i),(50+10*i)), standardize=TRUE, seed=123, max_runtime_secs=300, variable_importances=TRUE) #cross validation predictTrain<-predict(hNeuralNet,trainH2O) predictTest<-predict(hNeuralNet,testH2O) R2(predictTrain,trainH2O$deltaH) R2(predictTest,testH2O$deltaH) }A hidden layer width of 130 was selected. The data was standardized to facilitate H2O, and the seed was set to make the results reproducible. The final model is shown below:
hNeuralNet<-h2o.deeplearning(y="deltaH", training_frame=trainH2O, model_id="hNeuralNet", standardize=TRUE, seed=123, hidden=c(130,130), variable_importances=TRUE) summary(hNeuralNet) [output] H2ORegressionMetrics: deeplearning Reported on training data. **Metrics reported on full training frame** MSE: 27.54657 RMSE: 5.248483 MAE: 3.770966 RMSLE: NaN Mean Residual Deviance : 27.54657 Variable Importances: variable relative_importance scaled_importance percentage 1 nAtomP 1.000000 1.000000 0.013171 2 SPC.4 0.967459 0.967459 0.012743 3 khs.sssCH 0.964586 0.964586 0.012705 4 MLogP 0.961444 0.961444 0.012664 5 ATSm1 0.956220 0.956220 0.012595 --- 80 nRotB 0.829468 0.829468 0.010925 81 SCH.7 0.827819 0.827819 0.010904 82 VCH.7 0.815953 0.815953 0.010747 83 C2SP2 0.801603 0.801603 0.010558 84 TopoPSA 0.798668 0.798668 0.010520 85 nAromBlocks 0.781421 0.781421 0.010292 [output] #Saving information predictTrain<-predict(hNeuralNet,trainH2O) predictTest<-predict(hNeuralNet,testH2O) resT <- residuals(predictTest,testH2O$deltaH) resTr <- residuals(predictTrain,trainH2O$deltaH) TEST <- cbind(as.data.frame(testH2O),as.data.frame(predictTest),as.data.frame(resT)) TRAIN <- cbind(as.data.frame(trainH2O),as.data.frame(predictTrain),as.data.frame(resTr)) #cluster information will be added in manually write.csv(TEST, file="NNTestSetPredict.csv") write.csv(TRAIN, file="NNTrainSetPredict.csv") h2o.saveModel(object=hNeuralNet, path=getwd(), force=TRUE)Here are the resulting files: hNeuralNet, NNTrainSetPredict.csv, NNTestSetPredict.csv, NNSetPredict.csvOur test set R^2 value is 0.903, an increase of 1.8% relative to RF. Our test set RMSE is 6.601, a 7.0% reduction in error. For our individual clusters:
Cluster 1: RMSE: 5.13
Cluster 2: RMSE 7.43
Our file, with crossvalidation statistics is NNTestSetCV.csv. Our chemical space with residuals is also shown below:
References
1. Wiiliam E. Acree and Andrew SID Lang. Acree Enthalpy of Solvation Dataset. figshare. (2015). https://doi.org/10.6084/m9.figshare.1572326.v1